Dmoe integration #1210

DayOfThePenguin · 2024-05-06T03:56:55Z

Supersedes #1197

This PR adds dropless MoE support using the Grouped GEMM implementation in megablocks.

Features

Unlike the legacy DeepSpeed MoE implementation that uses the data parallel groups for expert parallelism, this implementation uses the model parallel group to parallelize the experts. This avoids the following problems:

Using data parallel groups to distribute the experts will incur inter-node communications to do a forward pass through a single layer
MoE + pipeline parallelism is very complicated to reason about when you have MoE weights distributed across data parallel groups & deepspeed doesn't natively support it.

Clarified arguments by separating MoE args into their own class.

Use sinkhorn routing by default, support k>=1. TopK routing is used for evaluation/inference.

Testing

Tested PP [3, 2, 1] and MP [1, 2, 4, 8] on Ampere GPUs.

Notes

Added megablocks and grouped_gemm to the dependencies. It might be desirable to pull some of the kernels in directly like in NVIDIA megatron-core.

…r 11.x

…oe_integration

…neox into dmoe_integration

…oe_integration

- Removed mp assertion for moe - Removed mlp_type checks in moe code - Added Bf16 conversion to dmoe_gather

aurelion-source · 2024-12-10T18:29:58Z

megatron/mpu/mappings.py

@@ -185,9 +180,102 @@ def _dmoe_gather(input_: torch.Tensor, tokens_per_expert: torch.Tensor):
    # Note: torch.cat already creates a contiguous tensor.
    output = torch.cat(tensor_list, dim=gather_dim)

-    # Bf16 convert


Removing this results in fp32 output.

This was resolved in the latest commit.

aurelion-source · 2024-12-10T18:36:57Z

Profiles before and after the merge:
https://wandb.ai/shetano-personal/dmoe/reports/DMOE--VmlldzoxMDU0NDY0OQ

Quentin-Anthony · 2024-12-12T18:52:20Z

configs/125M-dmoe.yml

@@ -67,35 +58,26 @@

   # regularization
   "gradient_clipping": 1.0,
-   "weight_decay": 0.1,
+   "weight_decay": 0.0,


there appear to be a lot of extraneous config changes. Any reason why?

Quentin-Anthony · 2024-12-12T18:53:44Z

megatron/neox_arguments/arguments.py

@@ -1075,15 +1077,6 @@ def calculate_derived(self):
        # if we set pipe_parallel_size to 0, GPT2ModelPipe.to_sequential() is called, and we run training with
        # the sequential model without the PipelineModule wrapper to avoid the overhead it incurs
        self.update_value("is_pipe_parallel", self.pipe_parallel_size >= 1)
-        if self.moe_num_experts > 1:


Did we test these parallelism combinations?

DayOfThePenguin added 3 commits May 5, 2024 23:07

feat: ensure only the right architectures get built vs all of them fo…

52a2001

…r 11.x

feat: clean up megablocks-based DMoE implementation

9c66895

feat: update args, configs, and requirements

3388c51

DayOfThePenguin requested a review from Quentin-Anthony as a code owner May 6, 2024 03:56

github-actions and others added 9 commits May 6, 2024 03:57

Update NeoXArgs docs automatically

e987126

Merge branch 'main' of https://github.com/EleutherAI/gpt-neox into dm…

7d6f265

…oe_integration

Merge branch 'dmoe_integration' of https://github.com/EleutherAI/gpt-…

4aebc2c

…neox into dmoe_integration

Update NeoXArgs docs automatically

d9f8d55

feat: Update readme and example config

f8c3776

Merge branch 'dmoe_integration' of https://github.com/EleutherAI/gpt-…

3ef5b66

…neox into dmoe_integration

Update NeoXArgs docs automatically

33e41d7

Merge branch 'main' of https://github.com/EleutherAI/gpt-neox into dm…

613aeb9

…oe_integration

Update NeoXArgs docs automatically

35c7225

Quentin-Anthony added the merge-queue This PR is next on the queue to merge label Dec 3, 2024

Quentin-Anthony and others added 2 commits December 3, 2024 15:53

Merge branch 'main' into dmoe_integration

fb68c07

- Updated sinkhorn initialization and add max_iter argument.

542103f

- Removed mp assertion for moe - Removed mlp_type checks in moe code - Added Bf16 conversion to dmoe_gather

aurelion-source reviewed Dec 10, 2024

View reviewed changes

Quentin-Anthony reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dmoe integration #1210

Dmoe integration #1210

DayOfThePenguin commented May 6, 2024

aurelion-source Dec 10, 2024 •

edited

Loading

aurelion-source Dec 10, 2024

aurelion-source commented Dec 10, 2024

Quentin-Anthony Dec 12, 2024

Quentin-Anthony Dec 12, 2024

Dmoe integration #1210

Are you sure you want to change the base?

Dmoe integration #1210

Conversation

DayOfThePenguin commented May 6, 2024

Features

Testing

Notes

aurelion-source Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

aurelion-source Dec 10, 2024

Choose a reason for hiding this comment

aurelion-source commented Dec 10, 2024

Quentin-Anthony Dec 12, 2024

Choose a reason for hiding this comment

Quentin-Anthony Dec 12, 2024

Choose a reason for hiding this comment

aurelion-source Dec 10, 2024 •

edited

Loading